This work comes from following along with the R for Data Science book. The book is freely available online at https://r4ds.had.co.nz/
# import the tidyverse library
library(tidyverse)
# update packages
# tidyverse_update()
# load other packages required for the book
library(nycflights13)
# load ggplot
library(ggplot2)
# used in chapter 7
library(ggstance)
library(lvplot)
Use the mpg data frame and visualization techniques to answer this question: Do cars with big engines use more fuel than cars with small engines?
mpg data frameThe mpg dataframe is included in the ggplot2 package. It contains data on 38 car models, collected by the US Environmental Protection Agency.
mpg
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l~ f 18 29 p comp~
## 2 audi a4 1.8 1999 4 manual~ f 21 29 p comp~
## 3 audi a4 2 2008 4 manual~ f 20 31 p comp~
## 4 audi a4 2 2008 4 auto(a~ f 21 30 p comp~
## 5 audi a4 2.8 1999 6 auto(l~ f 16 26 p comp~
## 6 audi a4 2.8 1999 6 manual~ f 18 26 p comp~
## 7 audi a4 3.1 2008 6 auto(a~ f 18 27 p comp~
## 8 audi a4 quat~ 1.8 1999 4 manual~ 4 18 26 p comp~
## 9 audi a4 quat~ 1.8 1999 4 auto(l~ 4 16 25 p comp~
## 10 audi a4 quat~ 2 2008 4 manual~ 4 20 28 p comp~
## # ... with 224 more rows
We can use ?mpg to get more information on the dataframe.
The dataframe has 234 rows with the following 11 variables:
Plot disp on the x-axis and hwy on the y-axis.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
Notice the negative relationship between engine size (
displ) and highway fuel efficiency (hwy), indicating that cars with big engines use more fuel.
Not in book: Let’s try this with city mileage as well.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = cty))
Looks pretty much the same as highway mileage.
ggplotThe call to ggplot() create a coordinate system which we’ll add layers to. We connect a dataset to the plot using the data argument. This creates an empty plot.
ggplot(data = mpg)
Just a blank figure.
# rows in mpg
print(paste("rows =", nrow(mpg)))
## [1] "rows = 234"
# columns in mpg
print(paste("columns =", ncol(mpg)))
## [1] "columns = 11"
drv describes the type of drive train in the vehicle, whereggplot(data = mpg) +
geom_point(mapping = aes(x = hwy, y = cyl))
ggplot(data = mpg) +
geom_point(mapping = aes(x = class, y = drv))
This is not useful because both of these variables are categorical. We’re not gaining much information by displaying the data as a scatter plot.
Mapping aestetics in your plot to variables in the data can help to illuminate interesting patterns. Let’s map the class cariable to a plot of displ and hwy to see how those distribute.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
We’ve mapping the colors of the points to the class variable. The 2seater class seem to be outliers with large engines and good hwy mileage. This makes sense because these sports cars have large engines with small bodies.
We can also map variables to the size aesthetic. It’s best not to map size to a categorical variable.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))
## Warning: Using size for a discrete variable is not advised.
We also have the alpha or shape aesthetics.
# Left
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
## Warning: Using alpha for a discrete variable is not advised.
# Right
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (geom_point).
The shapes aesthetic can only map six categories, and the rest will not be included.
You can also set the aesthetic properties manually. When the aesthetic is set manually (and not depending on a variable), it must be set OUTSIDE of the aes call.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = 'blue')
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))The color is set manually, so it needs to be outside of the aes() call.
The categorical variables are:
The continuous variables are:
When we look at the mpg dataframe. We can look at the type of data that is in each column. chr variables are categorical. num variables are continuous. int variables aren’t technically categorical nor continuous. Treating them as categorical is actually reducing the information that we have, so we might want to consider them as ordinal.
Continuous Case:
# map color
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = displ))
# map size
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = displ))
# map shape - doesn't work
# ggplot(data = mpg) +
# geom_point(mapping = aes(x = displ, y = hwy, shape = displ))
Categorical Case:
# map color
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = drv))
# map size
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = drv))
## Warning: Using size for a discrete variable is not advised.
# map shape
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = drv))
Shape doesn’t map with a continuous variable. Color takes on a gradient.
Size doesn’t work well with a categorical variable.
# map color and size
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = displ, size = displ))
We get both of the aesthetics on the same plot.
# map color
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, stroke = 5))
The stroke aesthetic modifies the width of the border for each point.
# map color
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = displ < 5))
We can set functions to set the aesthetics according to the data.
Facets allow us to split plots into multiple subplots. We can facet the plot by a single variable using facet_wrap(). The first argument of facep_wrap() is a formula which specifies the variable to so split the plot on.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
We can facet the plot on the combination to two variables as well, using facet_grid().
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
To avoid rows and columns, we can use . instead of a variable name.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy,
color = class)) +
facet_grid(. ~ class)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cty)
Creating the facet using a continuous variable splits up the graph into each of the continuous values, resulting in a terrible graph.
facet_grid(drv ~ cyl) mean?ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
The empty plots are facets that did not have any corresponding data. For instance, rear-wheel-drive did not have any cars with engine displacement (displ) of 4 or 5 litres.
How do they relate to this plot?
ggplot(data = mpg) +
geom_point(mapping = aes(x = drv, y = cyl))
We see that we’re missing points at each of the variable combinations that aren’t present in our dataset.
. do?ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)
We create facets using just one variable. The . retains the rows or columns so that we don’t tile in that direction.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
What are the advantages to using faceting instead of the color aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?
We can look at the color aesthetic here to compare.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
One thing we can see immediately is that the overlapping points make it more difficult to look at a single group when we use the color aesthetic compared to the facet aesthetic. If we’re interested in comparing points and looking for patterns within groups, then the facet aesthetic is more usefule here.
A disadvantage of using facets is that we lose some ability to compare across groups. With the color aesthetic, we immediately see how groups are mixed into one another, but this is less obvious with the facets.
Depending on how many groups we’re comparing, each of these aesthetics will change as well. If we have too many groups, facets will quickly become overwhelming to look at. In larger datasets with fewer groups, we might prefer to separate groups using facets so that we have less density per plot.
?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?The facet_wrap function wraps a sequence of panels into a 2-dimensional figure. The nrow and ncol arguments allow you to define the number of rows and columns for the wrapped figure.
The as.table option controls the order of the panels in the output. If as.table is TRUE (default), then te highest values will be at the bottom-right, otherwise, the highest value is at the top-left.
The dir option specifies whether the direction should be “h” for horizontal (default) or “v” for vertical.
facet_grid() does not have the nrow or ncol argumetns because we pass in the rows and cols arguments to define the groups that will create the rows and columns of the grid.
facet_grid() you should usually put the variable wit more unique levels in the columns. Why?This would create a landscape figure rather than a tall portait figure.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(year ~ cyl)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(cyl ~ year)
The graphs are clearer and easier to read when we put the variable with more levels into the columns.
Plots can represent the same data with different visual objects. These are referred to as geoms. Take the following two graphs for example.
# scatter plot
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
# line plot
ggplot(data=mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Every geom in ggplot2 takes a mapping argument, however not every aesthetic can work with every geom. For instance, you can set the shape of a point, but not of a line.
Linetype can be set for a line.
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The drv variable describes the car’s drivetrain, which include 4-wheel-drive, forward-wheel-drive, and rear-wheel-drive. We’ve set the aesthetic to separate the data according to drivetrain and the linetype reflects each group.
We can see this even more clearly by overlapping multiple geoms on one plot.
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(linetype = drv, color = drv)) +
geom_point(mapping = aes(color = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Here are some resources for the different geoms available for ggplot2:
Geoms like geom_smooth() use a single object to represent the data. We can set the group aesthetic to split the object into multiple groups. The convenient thing about the group feature is that a legend is not added automatically to our plot.
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
We can also manually remove the legend from aesthetics that do generate one, such as the color aesthetic.
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy,
group = drv, color = drv),
show.legend = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
To include multiple geoms in the same plot, we just add them to the same ggplot() function.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
By writing the code like this, we duplicate the \(x\) and \(y\) axis assignments. This could make things difficult for updating these axes to different variables. Instead, we can add these assignments to the ggplot() function where they serve as global variables.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
You can still extend or overwrite the global mappings by placing them directly in the geom function.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
You can use this to specify different data for the geoms as well. Here, we’ll subset the smooth line to just the subcompact cars.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth(data = filter(mpg, class == 'subcompact'),
se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
geom_line()geom_boxplot()geom_histogram()geom_area()Prediction: Map a scatter plot of displ against hwy, colored by drv. Includes smooth lines, also separated by drv and without standard errors.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
show.legend = FALSE do? What happens if you remove it? Why do you think I used it earlier in the chapter.The show.legend flag indicates whether a legend should be included in the plot. It was probably used earlier because the legend only pertained to one of the plots.
se argument to geom_smooth() do? The se argument indicates whether the standard error region associated with the smooth line should be displayed.ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = TRUE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Prediction: No, because the first plot simply sets global aesthetics, while the second plot sets the same aesthetics locally.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot() +
geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Top left graph:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(size=4) +
geom_smooth(se = FALSE, size=2)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Top right plot:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(size=4) +
geom_smooth(se = FALSE, size=2, mapping = aes(group = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Middle left plot:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point(size=4) +
geom_smooth(se = FALSE, size=2, mapping = aes(group = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Middle right plot:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(size=4, mapping = aes(color = drv)) +
geom_smooth(se = FALSE, size=2)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Bottom left plot:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(size=4, mapping = aes(color = drv)) +
geom_smooth(se = FALSE, size=2, mapping = aes(linetype = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Bottom right plot:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(size=6, color = 'gray97') +
geom_point(size=3, mapping = aes(color = drv))
Bar charts are a useful way to look at data. Here’s a bar chart using the diamonds dataset, groupbed by cuts. This dataset contiains information about ~54,000 diamonds, including price, carat, color, clarity and cut of each diamond. We can see from the plot that more diamonds are available with high quality cuts than with low quality cuts.
names(diamonds)
## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"
## [8] "x" "y" "z"
ggplot(data = diamonds) +
geom_bar(mapping = aes(x=cut))
The \(y\)-axis variable, count is not in the diamonds dataset. Some graphs create new values to plot.
The algorithm used to produce the transformation is called a stat (i.e. statistical tranformation). You can get more info about what stat a geom uses by inspecting its default stat argument. For examples, ?geom_bar shows us the the stat is “count”, which means that it uses the stat_count() function.
You can generally use stats and geoms interchangably. This produces the same graph as before,
ggplot(data=diamonds) +
stat_count(mapping = aes(x=cut))
Every geom has a default stat and every stat has a default geom.
There are three reason why you might want to select a stat specifically:
demo <- tribble(
~cut, ~freq,
"Fair", 1610,
"Good", 4906,
"Very Good", 12082,
"Premium", 13791,
"Ideal", 21551
)
ggplot(data = demo) +
geom_bar(mapping = aes(x = cut, y = freq), stat = "identity")
ggplot(data=diamonds) +
geom_bar(mapping = aes(x=cut, y=stat(prop), group=1))
stat_summary() to summarize the y values for each unique x value to draw attention to the summary that we’re computing.ggplot(data=diamonds) +
stat_summary(
mapping = aes(x=cut, y=depth),
fun.ymin = min,
fun.ymax = max,
fun.y = median
)
## Warning: `fun.y` is deprecated. Use `fun` instead.
## Warning: `fun.ymin` is deprecated. Use `fun.min` instead.
## Warning: `fun.ymax` is deprecated. Use `fun.max` instead.
stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function??stat_summary()
The default geom associated with stat_summary() is the pointrange geom.
?geom_pointrange()
# create a df with the values first
diamonds_summary <- diamonds %>%
group_by(cut) %>%
summarize(lower = min(depth), med = median(depth), upper = max(depth))
ggplot(data=diamonds_summary) +
geom_pointrange(mapping = aes(x=cut, y=med, ymax=upper, ymin=lower)
) +
ylab("depth")
geom_col() do? How is it different to geom_bar()??geom_col
Both geom_col and geom_bar are geoms to create bar charts. The bar heights using geom_col represents values in the data. We need to specify two variables for each of the x and y-axes. The bar heights represent the y-axis variable’s value for each of the categories on the x-axis variable.
The geom_bar represents the bar height proportional to the number of cases in each group. We only specify one variable for the x-axis, and the bar heights are computed to represent the count for each category.
We can look at examples of the two here.
# bar height is proportional
ggplot(data=diamonds) +
geom_bar(mapping=aes(x=cut))
# bar height is proportional
ggplot(data=diamonds) +
geom_col(mapping=aes(x=cut, y=depth))
See geom cheat sheets. https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
stat_smooth() compute? What parameters control its behaviour?The stat_smooth() function adds a smooth line to data points that allows us to view patterns in the data.
method argument allows us to choose how the smooth line is gererated with smoothing methods including “lm”, “glm”, “gam”, “loess”, or a function.function argument allos you to enter a smoothing function like “y ~ x”.se is a boolean to include standard errors or notggplot(data=iris, aes(x=Sepal.Length, y=Petal.Width, color=Species)) +
geom_point() +
geom_smooth(se=TRUE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
group = 1. Why? In other words what is the problem with these two graphs?ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop..))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop.., group=1))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = color, y = ..prop.., group=1))
Without the group argument, the geom will group the data according to the x-axis variable cut. Since we are interested in proportions, we need to consider al of the variables together, so we need to have them as a single group. Notice how the color argument no longer distinguished between the different cuts when we add the group argument.
We can adjust colors in bar charts using eithe the color or fill aesthetics.
ggplot(data=diamonds) +
geom_bar(mapping = aes(x=cut, color=cut))
ggplot(data=diamonds) +
geom_bar(mapping = aes(x=cut, fill=cut))
We can even map the color to another variable. The bars will reflect the data and the colors will be stacked to represent the aesthetic variable.
ggplot(data=diamonds) +
geom_bar(mapping = aes(x=cut, fill=clarity))
The stacking is done automatically by the position adjustment specified by the position argument. If you don’t want a stacked barchart, you can use one of the three other options:
"identity""dodge""fill"The "identity" option places each object exactly where it falls in the context of the graph. For bar charts, this simply overlaps the bars. To see the overlapping bars, we need to adjust one of the other aesthetics, like alpha.
# set the transparency of the fill using alpha
ggplot(data=diamonds, mapping=aes(x=cut, fill=clarity)) +
geom_bar(alpha=1.5, position="identity")
# remove the fill by setting fill to NA
ggplot(data=diamonds, mapping=aes(x=cut, color=clarity)) +
geom_bar(fill=NA, position="identity")
The "fill" option works like stacking but makes each stack of bars the same height. This makes comparing proportions easier.
ggplot(data=diamonds) +
geom_bar(mapping = aes(x=cut, fill=clarity), position="fill")
The "dodge" option places overlapping objects beside one another. This makes for easier comparisons between individual values.
ggplot(data=diamonds) +
geom_bar(mapping = aes(x=cut, fill=clarity), position="dodge")
One more option that is not useful for barcharts can be used with scatterplots. This is the "jitter" position which allows us to see overlapping points on a scatterplot. Notice in the following plot that we only see 126 points even though there are 234 observations.
ggplot(data=mpg) +
geom_point(mapping = aes(x=displ, y=hwy))
We can avoid this with the "jitter" option, which adds a small amount of random noise to the points on the plot.
ggplot(data=mpg) +
geom_point(mapping = aes(x=displ, y=hwy), position="jitter")
This can really help display the data on a plot. Another way to jitter points on a scatter plot is to use the geom_jitter() geom.
ggplot(data=mpg, mapping = aes(x=cty, y=hwy)) +
geom_point()
We might want to jitter the points to view more.
ggplot(data=mpg, mapping = aes(x=cty, y=hwy)) +
geom_jitter()
Now we can see much more data on the graph.
geom_jitter() control the amount of jittering?The width and height parameters control the amount of horizontal and vertical jitter. The default values are 40%. We can try to play around with this setting.
# more horizontal jitter
ggplot(data=mpg, mapping = aes(x=cty, y=hwy)) +
geom_jitter(width=0.7)
# more veritcal jitter
ggplot(data=mpg, mapping = aes(x=cty, y=hwy)) +
geom_jitter(height=0.7)
# less horizontal jitter
ggplot(data=mpg, mapping = aes(x=cty, y=hwy)) +
geom_jitter(width=0.1)
# less vertical jitter
ggplot(data=mpg, mapping = aes(x=cty, y=hwy)) +
geom_jitter(height=0.1)
geom_jitter() with geom_count().ggplot(data=mpg, mapping = aes(x=cty, y=hwy)) +
geom_jitter()
ggplot(data=mpg, mapping = aes(x=cty, y=hwy)) +
geom_count()
The two geoms are quite similar, with geom_jitter we look at the density of points to see how areas that are highly represnted and with geom_count we look at the size of the points instead. The benefit of geom_count is that the points are in their original position so we get a more accurate representation of the data, however, the true number of points is still somewhat hidden.
geom_boxplot()? Create a visualization of the mpg dataset that demonstrates it.The default position of geom_boxplot is “dodge2”. We can look at the class variable colored by drv to view this.
ggplot(data=mpg) +
geom_boxplot(mapping = aes(x=class, y=hwy, color=drv))
We can try playing around with other postions as well.
ggplot(data=mpg) +
geom_boxplot(mapping = aes(x=class, y=hwy, fill=drv, alpha=0.2), position="identity")
The default coordinate system is the Cartesian coordinate system where the \(x\) and \(y\) coordinates act independently to determine the location of each point. There are other coordinate systems that we can use.
coord_flip() switches the \(x\) and \(y\) axes. This comes in handy for making things like horizontal box plots.
# vertical boxplots
ggplot(data = mpg, mapping=aes(x=class, y=hwy, fill=class)) +
geom_boxplot(show.legend = FALSE)
# horizontal boxplots
ggplot(data=mpg, mapping=aes(x=class, y=hwy, fill=class)) +
geom_boxplot(show.legend = FALSE) +
coord_flip()
coord_quickmap() sets the aspect ratio correctly for maps. This is important for plotting spatial data.
map <- map_data("world")
ggplot(map, aes(long, lat, group=group, fill=region)) +
geom_polygon(color="black", show.legend=FALSE)
ggplot(map, aes(long, lat, group=group, fill=region)) +
geom_polygon(color="black", show.legend=FALSE) +
coord_quickmap()
coord_polar() uses polar coordinates. This can reveal interesting connections between a bar chart and Coxcomb chart.
bar <- ggplot(data=diamonds) +
geom_bar(aes(x=cut, fill=cut),
show.legend = FALSE,
width = 1) +
theme(aspect.ratio=1) +
labs(x=NULL, y=NULL)
bar + coord_flip()
bar + coord_polar()
coord_polar()# without polar coordinates
ggplot(data=diamonds, aes(x = factor(1), fill=cut)) +
geom_bar(width=1)
# with polar coordinates
ggplot(data=diamonds, aes(x = factor(1), fill=cut)) +
geom_bar(width=1) +
coord_polar(theta="y") +
labs(x=NULL, y=NULL)
labs() do? Read the documentation.labs() allows you to add labels to your plot. You can add a title, subtitle, caption, tag, and update the x and y axes’ labels.
ggplot(data=diamonds, aes(x = factor(1), fill=cut)) +
geom_bar(width=1) +
coord_polar(theta="y") +
labs(x=NULL, y=NULL,
title="Proportion of Diamond Cuts",
subtitle="A pie chart",
caption="This pie chart shows us the proportion of diamonds with different cuts in the diamonds dataset.
Most diamonds are cute with the Ideal cut, followed by Premium.",
tag="diamonds dataset"
)
coord_quickmap() and coord_map()? coord_map() projects a portion of the earth onto a flat 2D plane and does not typically preserve straight lines. It requires a lot fo computation.coord_quickmap() approximates the projection and does preserve straight lines. It’s better to use for smaller areas closer to the equator.
coord_fixed() important? What does geom_abline() do?ggplot(data=mpg, mapping=aes(x=cty, y=hwy)) +
geom_point() +
geom_abline() +
coord_fixed()
This plot tells us that there is positive relationship between city miles and highway miles that a car can drive per gallon of gas. Since the points are all above the line, we see that a car always gets more highway miles than city miles.
The abline plots the diagonal line. We’re looking at the line where city miles is equal to highway miles.
Now we can develop a code template that uses all of the techniques we’ve learned to generat a plot with ggplot.
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
)
The seven parameters for this template compose the grammar of graphics, a formal system for building plots. This grammar is based on the insight that you can uniquely describe any plot as a combination of a dataset, a geom, a set of mappings, a stat, a position adjustment, a coordinat system, and a faceting scheme.
This chapter is pretty basic stuff so I’m going to skip it. See the R For Data Science book for the details: https://r4ds.had.co.nz/workflow-basics.html.
These are the topics covered in this chapter: - 4.1 coding basics - 4.2 variable naming conventions - 4.3 functions
We’ll be using the flights dataframe from the nycflights13 package. This dataframe contains all 336,776 flights that departed from New York City in 2013. This data frame comes from the Bureau of Transportation Statistics.
head(flights)
## # A tibble: 6 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
The data is actualy in tibble form. A tibble is a dataframe that is slightly tweaked to work better in the tidyverse. More details on tibbles will be in the wrangle section.
This chapter will introduce the five key dplyr functions that will allow us to solve the vast majority of data manipulation challenges.
filter()arrange()select()mutate()These can all be used with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data manipulation.
All verbs work similary:
filter()filter() allows you to subset observations based on their values. The first argument is the dataframe. The second and subsequent arguments are the expressions that filter the dataframe.
For example, we can select all the flights on January 1st with:
filter(flights, month==1, day==1)
## # A tibble: 842 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ... with 832 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
We can store the filtered dataframe in a variable.
jan1 <- filter(flights, month==1, day==1)
We can use logical operators with filter to do more complex commands. Say we wanted to get all flights in November or December. We could do
filter(flights, month==1 | month==12)
## # A tibble: 55,139 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ... with 55,129 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
A useful short-hand for this is x %in% y. This allows us select every row where \(x\) is one of the values in \(y\). We can rewrite the code above as
nov_dev <- filter(flights, month %in% c(11,12))
filter() only includes rows where the condition is true. If you want to keep NA values then ask for them explicitly.
# create a tibble
df <- tibble(x = c(1, NA, 3))
# doesn't return NA
filter(df, x>1)
## # A tibble: 1 x 1
## x
## <dbl>
## 1 3
# returns NA rows
filter(df, is.na(x) | x>1)
## # A tibble: 2 x 1
## x
## <dbl>
## 1 NA
## 2 3
filter(flights, arr_delay >= 2)
## # A tibble: 127,929 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 554 558 -4 740 728
## 5 2013 1 1 555 600 -5 913 854
## 6 2013 1 1 558 600 -2 753 745
## 7 2013 1 1 558 600 -2 924 917
## 8 2013 1 1 559 600 -1 941 910
## 9 2013 1 1 600 600 0 837 825
## 10 2013 1 1 602 605 -3 821 805
## # ... with 127,919 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter(flights, dest=='HOU' | dest=='IAH')
## # A tibble: 9,313 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 623 627 -4 933 932
## 4 2013 1 1 728 732 -4 1041 1038
## 5 2013 1 1 739 739 0 1104 1038
## 6 2013 1 1 908 908 0 1228 1219
## 7 2013 1 1 1028 1026 2 1350 1339
## 8 2013 1 1 1044 1045 -1 1352 1351
## 9 2013 1 1 1114 900 134 1447 1222
## 10 2013 1 1 1205 1200 5 1503 1505
## # ... with 9,303 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# find out abbreviations for airlines
?flights
## starting httpd help server ... done
# airlines are stored in the airlines tibble
airlines
## # A tibble: 16 x 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
## 7 F9 Frontier Airlines Inc.
## 8 FL AirTran Airways Corporation
## 9 HA Hawaiian Airlines Inc.
## 10 MQ Envoy Air
## 11 OO SkyWest Airlines Inc.
## 12 UA United Air Lines Inc.
## 13 US US Airways Inc.
## 14 VX Virgin America
## 15 WN Southwest Airlines Co.
## 16 YV Mesa Airlines Inc.
# filter data frame with UA, AA, DL
filter(flights, carrier %in% c('UA', 'AA', 'DL'))
## # A tibble: 139,504 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 554 600 -6 812 837
## 5 2013 1 1 554 558 -4 740 728
## 6 2013 1 1 558 600 -2 753 745
## 7 2013 1 1 558 600 -2 924 917
## 8 2013 1 1 558 600 -2 923 937
## 9 2013 1 1 559 600 -1 941 910
## 10 2013 1 1 559 600 -1 854 902
## # ... with 139,494 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter(flights, month %in% c(7,8,9))
## # A tibble: 86,326 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 7 1 1 2029 212 236 2359
## 2 2013 7 1 2 2359 3 344 344
## 3 2013 7 1 29 2245 104 151 1
## 4 2013 7 1 43 2130 193 322 14
## 5 2013 7 1 44 2150 174 300 100
## 6 2013 7 1 46 2051 235 304 2358
## 7 2013 7 1 48 2001 287 308 2305
## 8 2013 7 1 58 2155 183 335 43
## 9 2013 7 1 100 2146 194 327 30
## 10 2013 7 1 100 2245 135 337 135
## # ... with 86,316 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter(flights, dep_delay<=0, arr_delay>2)
## # A tibble: 34,583 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 554 558 -4 740 728
## 2 2013 1 1 555 600 -5 913 854
## 3 2013 1 1 558 600 -2 753 745
## 4 2013 1 1 558 600 -2 924 917
## 5 2013 1 1 559 600 -1 941 910
## 6 2013 1 1 600 600 0 837 825
## 7 2013 1 1 602 605 -3 821 805
## 8 2013 1 1 622 630 -8 1017 1014
## 9 2013 1 1 624 630 -6 909 840
## 10 2013 1 1 624 630 -6 840 830
## # ... with 34,573 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter(flights, dep_delay>=1, dep_delay-arr_delay >= 30)
## # A tibble: 8,974 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 701 700 1 1123 1154
## 2 2013 1 1 857 851 6 1157 1222
## 3 2013 1 1 909 810 59 1331 1315
## 4 2013 1 1 1025 951 34 1258 1302
## 5 2013 1 1 1625 1550 35 2054 2050
## 6 2013 1 1 1716 1545 91 2140 2039
## 7 2013 1 1 1900 1845 15 2212 2227
## 8 2013 1 1 1957 1945 12 2307 2329
## 9 2013 1 1 2035 2030 5 2337 5
## 10 2013 1 1 2046 2035 11 2144 2213
## # ... with 8,964 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter(flights, dep_time <= 600)
## # A tibble: 9,344 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ... with 9,334 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?between(x, left, right) is a shortcut for x >= left and x <= right. We can definitely shorten part 4: departed in summer (between july and september)
# challenge 4
filter(flights, between(month, 7, 9))
## # A tibble: 86,326 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 7 1 1 2029 212 236 2359
## 2 2013 7 1 2 2359 3 344 344
## 3 2013 7 1 29 2245 104 151 1
## 4 2013 7 1 43 2130 193 322 14
## 5 2013 7 1 44 2150 174 300 100
## 6 2013 7 1 46 2051 235 304 2358
## 7 2013 7 1 48 2001 287 308 2305
## 8 2013 7 1 58 2155 183 335 43
## 9 2013 7 1 100 2146 194 327 30
## 10 2013 7 1 100 2245 135 337 135
## # ... with 86,316 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
The rest are already simpler.
dep_time? What other variables are missing? What might these rows represent?(no_dep <- filter(flights, is.na(dep_time)))
## # A tibble: 8,255 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 NA 1630 NA NA 1815
## 2 2013 1 1 NA 1935 NA NA 2240
## 3 2013 1 1 NA 1500 NA NA 1825
## 4 2013 1 1 NA 600 NA NA 901
## 5 2013 1 2 NA 1540 NA NA 1747
## 6 2013 1 2 NA 1620 NA NA 1746
## 7 2013 1 2 NA 1355 NA NA 1459
## 8 2013 1 2 NA 1420 NA NA 1644
## 9 2013 1 2 NA 1321 NA NA 1536
## 10 2013 1 2 NA 1545 NA NA 1910
## # ... with 8,245 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
There are 8,255 flights missing dep_time. These are missing arr_time and air_time. These may be flights that were canceled and never took off.
NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)NA^0 # = 1
## [1] 1
NA | TRUE # TRUE
## [1] TRUE
FALSE & NA # FALSE
## [1] FALSE
NA * 0 # NA
## [1] NA
We get a solution for the first three because no matter what value of the NA is, the solution will be the same. Imagine the NA was infinity in the first statement, then the answer is still 1. For the second statment NA | TRUE, we only need one side to be true and the entire is statement is true. Thus, it doesn’t matter what the NA value is. Then in the third statement FALSE & NA, we need both sides to be true and since one is already FALSE, then the entire statement is false regardless of what the NA value is. Finally, the NA * 0 statement doesn’t work because most of the time 0 times a number is 0, however if the NA were infinity, the answer is not zero.
Inf * 0
## [1] NaN
This mean that we cannot get an answer for this statement.
arrange()arrange() works similarly to filter() except it changes the order of the rows. It takes a data frame and a set of column names to order by. If you provide more than one column name, each additional column will be used to break ties in the values of the preceding column.
arrange(flights, year, month, day)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Use desc() to re-order a column in descending order.
arrange(flights, desc(dep_delay))
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 9 641 900 1301 1242 1530
## 2 2013 6 15 1432 1935 1137 1607 2120
## 3 2013 1 10 1121 1635 1126 1239 1810
## 4 2013 9 20 1139 1845 1014 1457 2210
## 5 2013 7 22 845 1600 1005 1044 1815
## 6 2013 4 10 1100 1900 960 1342 2211
## 7 2013 3 17 2321 810 911 135 1020
## 8 2013 6 27 959 1900 899 1236 2226
## 9 2013 7 22 2257 759 898 121 1026
## 10 2013 12 5 756 1700 896 1058 2020
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Missing values are always sorted at the end.
df <- tibble(x = c(5,2,NA))
arrange(df, x)
## # A tibble: 3 x 1
## x
## <dbl>
## 1 2
## 2 5
## 3 NA
arrange(df, desc(x))
## # A tibble: 3 x 1
## x
## <dbl>
## 1 5
## 2 2
## 3 NA
arrange() to sort all missing values to the start? (Hint: use is.na())df <- tibble(x = c(1,NA,2,3,NA,4,5,NA))
# sort in descending order, with is.na()
arrange(df, desc(is.na(x)))
## # A tibble: 8 x 1
## x
## <dbl>
## 1 NA
## 2 NA
## 3 NA
## 4 1
## 5 2
## 6 3
## 7 4
## 8 5
flights to find the most delayed flights. Find the flights that left earliest.# sort the departure delays in descending order
arrange(flights, desc(dep_delay))
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 9 641 900 1301 1242 1530
## 2 2013 6 15 1432 1935 1137 1607 2120
## 3 2013 1 10 1121 1635 1126 1239 1810
## 4 2013 9 20 1139 1845 1014 1457 2210
## 5 2013 7 22 845 1600 1005 1044 1815
## 6 2013 4 10 1100 1900 960 1342 2211
## 7 2013 3 17 2321 810 911 135 1020
## 8 2013 6 27 959 1900 899 1236 2226
## 9 2013 7 22 2257 759 898 121 1026
## 10 2013 12 5 756 1700 896 1058 2020
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights to find the fastest (highest speed) flights.# sort by distance traveled divided by time spent in the air
arrange(flights, distance/air_time)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 28 1917 1825 52 2118 1935
## 2 2013 6 29 755 800 -5 1035 909
## 3 2013 8 28 932 940 -8 1116 1051
## 4 2013 1 30 1037 955 42 1221 1100
## 5 2013 11 27 556 600 -4 727 658
## 6 2013 5 21 558 600 -2 721 657
## 7 2013 12 9 1540 1535 5 1720 1656
## 8 2013 6 10 1356 1300 56 1646 1414
## 9 2013 7 28 1322 1325 -3 1612 1432
## 10 2013 4 11 1349 1345 4 1542 1453
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# sort by distance for shortest flights
arrange(flights, distance)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 7 27 NA 106 NA NA 245
## 2 2013 1 3 2127 2129 -2 2222 2224
## 3 2013 1 4 1240 1200 40 1333 1306
## 4 2013 1 4 1829 1615 134 1937 1721
## 5 2013 1 4 2128 2129 -1 2218 2224
## 6 2013 1 5 1155 1200 -5 1241 1306
## 7 2013 1 6 2125 2129 -4 2224 2224
## 8 2013 1 7 2124 2129 -5 2212 2224
## 9 2013 1 8 2127 2130 -3 2304 2225
## 10 2013 1 9 2126 2129 -3 2217 2224
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# sort by descending distance for farthest flights
arrange(flights, desc(distance))
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 857 900 -3 1516 1530
## 2 2013 1 2 909 900 9 1525 1530
## 3 2013 1 3 914 900 14 1504 1530
## 4 2013 1 4 900 900 0 1516 1530
## 5 2013 1 5 858 900 -2 1519 1530
## 6 2013 1 6 1019 900 79 1558 1530
## 7 2013 1 7 1042 900 102 1620 1530
## 8 2013 1 8 901 900 1 1504 1530
## 9 2013 1 9 641 900 1301 1242 1530
## 10 2013 1 10 859 900 -1 1449 1530
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
select()select() allows you to zoom in on a useful subset of columns using operations baed on the names of the variables.
# select columns by names
dplyr::select(flights, year, month, day)
## # A tibble: 336,776 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ... with 336,766 more rows
# select all columns between year and day
dplyr::select(flights, year:day)
## # A tibble: 336,776 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ... with 336,766 more rows
# select all columns except for those from year to day (inclusive)
select(flights, -(year:day))
## # A tibble: 336,776 x 16
## dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 517 515 2 830 819 11 UA
## 2 533 529 4 850 830 20 UA
## 3 542 540 2 923 850 33 AA
## 4 544 545 -1 1004 1022 -18 B6
## 5 554 600 -6 812 837 -25 DL
## 6 554 558 -4 740 728 12 UA
## 7 555 600 -5 913 854 19 B6
## 8 557 600 -3 709 723 -14 EV
## 9 557 600 -3 838 846 -8 B6
## 10 558 600 -2 753 745 8 AA
## # ... with 336,766 more rows, and 9 more variables: flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
There are some other useful helper functions that you can use with select():
starts_with(abc) matches names that begin with “abc”ends_with(xyz) matches names that end with “xyz”num_range("x", 1:3) matches x1, x2, x3Se more in ?select
select() can also be used to rename variables, but is rarely useful because it drops all of the variables not explicitly mentioned. Instead, use rename(), a variant of select() that keeps all the variables that aren’t explicitly mentioned.
rename(flights, date=day)
## # A tibble: 336,776 x 19
## year month date dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Another option is to use select() with everything(). This is useful if you have a handful of variables that you’d like to move to the start of the data frame.
select(flights, time_hour, air_time, everything())
## # A tibble: 336,776 x 19
## time_hour air_time year month day dep_time sched_dep_time
## <dttm> <dbl> <int> <int> <int> <int> <int>
## 1 2013-01-01 05:00:00 227 2013 1 1 517 515
## 2 2013-01-01 05:00:00 227 2013 1 1 533 529
## 3 2013-01-01 05:00:00 160 2013 1 1 542 540
## 4 2013-01-01 05:00:00 183 2013 1 1 544 545
## 5 2013-01-01 06:00:00 116 2013 1 1 554 600
## 6 2013-01-01 05:00:00 150 2013 1 1 554 558
## 7 2013-01-01 06:00:00 158 2013 1 1 555 600
## 8 2013-01-01 06:00:00 53 2013 1 1 557 600
## 9 2013-01-01 06:00:00 140 2013 1 1 557 600
## 10 2013-01-01 06:00:00 138 2013 1 1 558 600
## # ... with 336,766 more rows, and 12 more variables: dep_delay <dbl>,
## # arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
## # hour <dbl>, minute <dbl>
dep_time, dep_delay, air_time, and arr_delay from `flights.# using select()
select(flights, dep_time, dep_delay, arr_time, arr_delay)
## # A tibble: 336,776 x 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # ... with 336,766 more rows
# using select() with helper function starts_with()
select(flights, starts_with("dep"), starts_with("arr"))
## # A tibble: 336,776 x 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # ... with 336,766 more rows
# use select() with ends_with() and starts_with()
select(flights, ends_with("delay"), ends_with("time"), -starts_with(c("sched", "air")))
## # A tibble: 336,776 x 4
## dep_delay arr_delay dep_time arr_time
## <dbl> <dbl> <int> <int>
## 1 2 11 517 830
## 2 4 20 533 850
## 3 2 33 542 923
## 4 -1 -18 544 1004
## 5 -6 -25 554 812
## 6 -4 12 554 740
## 7 -5 19 555 913
## 8 -3 -14 557 709
## 9 -3 -8 557 838
## 10 -2 8 558 753
## # ... with 336,766 more rows
select() call?select(flights, year, year)
## # A tibble: 336,776 x 1
## year
## <int>
## 1 2013
## 2 2013
## 3 2013
## 4 2013
## 5 2013
## 6 2013
## 7 2013
## 8 2013
## 9 2013
## 10 2013
## # ... with 336,766 more rows
If you include a variable multiple times in select, the variables will only be selected once.
any_of() function do? Why might it be helpful in conjunction with this vector?vars <- c("year", "month", "day", "dep_delay", "arr_delay")
The any_of() function will select any of the columns using a vector. As opposed to the all_of() function, any_of() will not throw an error if the vector contains names that are not found as columns in the data frame.
select(flights, any_of(vars))
## # A tibble: 336,776 x 5
## year month day dep_delay arr_delay
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 2 11
## 2 2013 1 1 4 20
## 3 2013 1 1 2 33
## 4 2013 1 1 -1 -18
## 5 2013 1 1 -6 -25
## 6 2013 1 1 -4 12
## 7 2013 1 1 -5 19
## 8 2013 1 1 -3 -14
## 9 2013 1 1 -3 -8
## 10 2013 1 1 -2 8
## # ... with 336,766 more rows
select(flights, contains("TIME"))
## # A tibble: 336,776 x 6
## dep_time sched_dep_time arr_time sched_arr_time air_time time_hour
## <int> <int> <int> <int> <dbl> <dttm>
## 1 517 515 830 819 227 2013-01-01 05:00:00
## 2 533 529 850 830 227 2013-01-01 05:00:00
## 3 542 540 923 850 160 2013-01-01 05:00:00
## 4 544 545 1004 1022 183 2013-01-01 05:00:00
## 5 554 600 812 837 116 2013-01-01 06:00:00
## 6 554 558 740 728 150 2013-01-01 05:00:00
## 7 555 600 913 854 158 2013-01-01 06:00:00
## 8 557 600 709 723 53 2013-01-01 06:00:00
## 9 557 600 838 846 140 2013-01-01 06:00:00
## 10 558 600 753 745 138 2013-01-01 06:00:00
## # ... with 336,766 more rows
The helpers ignore case by default. To change this behaviour we can use the ignore.case argument.
select(flights, contains("TIME", ignore.case=FALSE))
## # A tibble: 336,776 x 0
mutate()We can add new columns that are functions of existing columns using mutate(). This will add new columns at the end of the dataset.
# create a dataframe with less columns
flights_sml <- select(flights, year:day, ends_with('delay'), distance, air_time)
# add new columns with mutate()
mutate(flights_sml,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60)
## # A tibble: 336,776 x 9
## year month day dep_delay arr_delay distance air_time gain speed
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 2 11 1400 227 -9 370.
## 2 2013 1 1 4 20 1416 227 -16 374.
## 3 2013 1 1 2 33 1089 160 -31 408.
## 4 2013 1 1 -1 -18 1576 183 17 517.
## 5 2013 1 1 -6 -25 762 116 19 394.
## 6 2013 1 1 -4 12 719 150 -16 288.
## 7 2013 1 1 -5 19 1065 158 -24 404.
## 8 2013 1 1 -3 -14 229 53 11 259.
## 9 2013 1 1 -3 -8 944 140 5 405.
## 10 2013 1 1 -2 8 733 138 -10 319.
## # ... with 336,766 more rows
Note that you can refere to the columns that you just created.
mutate(flights_sml,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours)
## # A tibble: 336,776 x 10
## year month day dep_delay arr_delay distance air_time gain hours
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 2 11 1400 227 -9 3.78
## 2 2013 1 1 4 20 1416 227 -16 3.78
## 3 2013 1 1 2 33 1089 160 -31 2.67
## 4 2013 1 1 -1 -18 1576 183 17 3.05
## 5 2013 1 1 -6 -25 762 116 19 1.93
## 6 2013 1 1 -4 12 719 150 -16 2.5
## 7 2013 1 1 -5 19 1065 158 -24 2.63
## 8 2013 1 1 -3 -14 229 53 11 0.883
## 9 2013 1 1 -3 -8 944 140 5 2.33
## 10 2013 1 1 -2 8 733 138 -10 2.3
## # ... with 336,766 more rows, and 1 more variable: gain_per_hour <dbl>
If you only want to keep the new variables, use transmute().
transmute(flights,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours)
## # A tibble: 336,776 x 3
## gain hours gain_per_hour
## <dbl> <dbl> <dbl>
## 1 -9 3.78 -2.38
## 2 -16 3.78 -4.23
## 3 -31 2.67 -11.6
## 4 17 3.05 5.57
## 5 19 1.93 9.83
## 6 -16 2.5 -6.4
## 7 -24 2.63 -9.11
## 8 11 0.883 12.5
## 9 5 2.33 2.14
## 10 -10 2.3 -4.35
## # ... with 336,766 more rows
There are many functions you might use with the mutate() function. Here are some useful ones.
+, -, *, /, ^%/% (integer division) and %% (remainder)log(), log2(), log10()lead() and lag() allow you to refer to the leading or lagging values.(x <- 1:10)
## [1] 1 2 3 4 5 6 7 8 9 10
lag(x)
## [1] NA 1 2 3 4 5 6 7 8 9
lead(x)
## [1] 2 3 4 5 6 7 8 9 10 NA
# use this to compute runnign differences
x-lag(x)
## [1] NA 1 1 1 1 1 1 1 1 1
# find when value changes
x != lag(x)
## [1] NA TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
cumsum() (cumulative sum), cumprod() (cumulative product), cummin() (cumulative min), cummax() (cumulative max), cummean() (cumulative mean).x
## [1] 1 2 3 4 5 6 7 8 9 10
cumsum(x)
## [1] 1 3 6 10 15 21 28 36 45 55
cumprod(x)
## [1] 1 2 6 24 120 720 5040 40320 362880
## [10] 3628800
cummin(x)
## [1] 1 1 1 1 1 1 1 1 1 1
cummax(x)
## [1] 1 2 3 4 5 6 7 8 9 10
cummean(x)
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
<, <=, >, >=, !=, ==y <- c(1,2,2,NA,3,4)
# rank in order
min_rank(y)
## [1] 1 2 2 NA 4 5
# rank in reverse order
min_rank(desc(y))
## [1] 5 3 3 NA 2 1
Here are some variants of the standard ranks.
row_number(y)
## [1] 1 2 3 NA 4 5
dense_rank(y)
## [1] 1 2 2 NA 3 4
percent_rank(y)
## [1] 0.00 0.25 0.25 NA 0.75 1.00
cume_dist(y)
## [1] 0.2 0.6 0.6 NA 0.8 1.0
dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.# make a smaller data frame
flights_sml <- select(flights, dep_time, sched_dep_time)
# multiply hours by 60 and add to minutes
# time is formatted as HHMM or HMM
mutate(flights_sml,
new_dep_time = dep_time %/% 100 * 60 + dep_time %% 100,
new_sched_dep_time = sched_dep_time %/% 100 * 60 + sched_dep_time %% 100
)
## # A tibble: 336,776 x 4
## dep_time sched_dep_time new_dep_time new_sched_dep_time
## <int> <int> <dbl> <dbl>
## 1 517 515 317 315
## 2 533 529 333 329
## 3 542 540 342 340
## 4 544 545 344 345
## 5 554 600 354 360
## 6 554 558 354 358
## 7 555 600 355 360
## 8 557 600 357 360
## 9 557 600 357 360
## 10 558 600 358 360
## # ... with 336,766 more rows
air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?I would expect to see the duration of the flight.
flights_sml <- select(flights, arr_time, dep_time, air_time)
mutate(flights_sml,
duration = arr_time - dep_time)
## # A tibble: 336,776 x 4
## arr_time dep_time air_time duration
## <int> <int> <dbl> <int>
## 1 830 517 227 313
## 2 850 533 227 317
## 3 923 542 160 381
## 4 1004 544 183 460
## 5 812 554 116 258
## 6 740 554 150 186
## 7 913 555 158 358
## 8 709 557 53 152
## 9 838 557 140 281
## 10 753 558 138 195
## # ... with 336,766 more rows
The computed duration doesn’t match the air_time in the table. This format doesn’t allow us to do a subtraction because it’s written in HMM format. We can convert to minutes like we did previously. These times are also recorded in the local timezone, so need to take this into account to get an accurate flight duration.
dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?select(flights, dep_time, sched_dep_time, dep_delay)
## # A tibble: 336,776 x 3
## dep_time sched_dep_time dep_delay
## <int> <int> <dbl>
## 1 517 515 2
## 2 533 529 4
## 3 542 540 2
## 4 544 545 -1
## 5 554 600 -6
## 6 554 558 -4
## 7 555 600 -5
## 8 557 600 -3
## 9 557 600 -3
## 10 558 600 -2
## # ... with 336,766 more rows
I would expect that dep_delay is the result of subtracting sched_dep_time from dep_time. We can add a column to verify this.
mutate(select(flights, dep_time, sched_dep_time, dep_delay),
difference = dep_time - sched_dep_time)
## # A tibble: 336,776 x 4
## dep_time sched_dep_time dep_delay difference
## <int> <int> <dbl> <int>
## 1 517 515 2 2
## 2 533 529 4 4
## 3 542 540 2 2
## 4 544 545 -1 -1
## 5 554 600 -6 -46
## 6 554 558 -4 -4
## 7 555 600 -5 -45
## 8 557 600 -3 -43
## 9 557 600 -3 -43
## 10 558 600 -2 -42
## # ... with 336,766 more rows
It looks like the calculation works for determing dep_delay from the other two columns.
min_rank().# add column for min rank
# filter by min rank column
# sort by the rank column
flights %>%
mutate(flights, dep_rank=min_rank(dep_delay)) %>%
filter(dep_rank <= 10) %>%
arrange(dep_rank)
## # A tibble: 12 x 20
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 7 2040 2123 -43 40 2352
## 2 2013 2 3 2022 2055 -33 2240 2338
## 3 2013 11 10 1408 1440 -32 1549 1559
## 4 2013 1 11 1900 1930 -30 2233 2243
## 5 2013 1 29 1703 1730 -27 1947 1957
## 6 2013 8 9 729 755 -26 1002 955
## 7 2013 10 23 1907 1932 -25 2143 2143
## 8 2013 3 30 2030 2055 -25 2213 2250
## 9 2013 3 2 1431 1455 -24 1601 1631
## 10 2013 5 5 934 958 -24 1225 1309
## 11 2013 5 14 914 938 -24 1143 1204
## 12 2013 9 18 1631 1655 -24 1812 1845
## # ... with 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, dep_rank <int>
min_rank() does not break ties, but instead assigned the same rank to equal values. We could also use row_number() which breaks ties by giving a lower rank to the value that is encountered first. We can try row_number() and see how the answer changes.
# add column for row_number()
# filter by row_number column
# sort by the row_number column
flights %>%
mutate(flights, dep_rank=row_number(dep_delay)) %>%
filter(dep_rank <= 10) %>%
arrange(dep_rank)
## # A tibble: 10 x 20
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 7 2040 2123 -43 40 2352
## 2 2013 2 3 2022 2055 -33 2240 2338
## 3 2013 11 10 1408 1440 -32 1549 1559
## 4 2013 1 11 1900 1930 -30 2233 2243
## 5 2013 1 29 1703 1730 -27 1947 1957
## 6 2013 8 9 729 755 -26 1002 955
## 7 2013 10 23 1907 1932 -25 2143 2143
## 8 2013 3 30 2030 2055 -25 2213 2250
## 9 2013 3 2 1431 1455 -24 1601 1631
## 10 2013 5 5 934 958 -24 1225 1309
## # ... with 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, dep_rank <int>
Now we have exactly 10 flights since we broke ties with the row_number() function, whereas we had 12 flights using the min_rank() function. Depending on our needs, we might prefer one method over the other.
1:3 + 1:10 return? Why?# 1:3 + 1:10
We get an error when we try to run this, saying that the longer object length is not a multiple of shorter object length. R is trying to apply the vector addition by vectorizing the input but does not know how. We can try to fix this by making the longer vector’s length a multiple of the shorter vector length.
1:3 + 1:12
## [1] 2 4 6 5 7 9 8 10 12 11 13 15
Now we get the vector 1:12 with the vector 1:3 repeated 4 times, and added across the 12 numbers. The shorter vector was repeated to make the length of the longer vector to complete the addition operation. This is similar to adding a scalar to a vector and having the scalar become a vectorized version of itself.
You can access the trigonometric functions documentation by typing ?Trig. R provides the following trig functions:
cos(x)sin(x)tan(x)acos(x)asin(x)atan(x)atan2(y,x)cospi(x)sinpi(x)tanpi(x)summarise()The last key verb is summarise(). It collapses a dataframe to a single row.
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 x 1
## delay
## <dbl>
## 1 12.6
summarise() is more usefule when paired with group_by(). This changes the unit of analysis from the entire dataframe to the individual groups. We can apply the same function to the grouped dataframe to get the average delay per date.
group_by(flights, year, month, day) %>%
summarise(delay = mean(dep_delay, na.rm = TRUE))
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
## # A tibble: 365 x 4
## # Groups: year, month [12]
## year month day delay
## <int> <int> <int> <dbl>
## 1 2013 1 1 11.5
## 2 2013 1 2 13.9
## 3 2013 1 3 11.0
## 4 2013 1 4 8.95
## 5 2013 1 5 5.73
## 6 2013 1 6 7.15
## 7 2013 1 7 5.42
## 8 2013 1 8 2.55
## 9 2013 1 9 2.28
## 10 2013 1 10 2.84
## # ... with 355 more rows
Not going to go through this. See R4ds book online.
The na.rm flag is used to remove NA values before the computation. Alternatively, we can filter the NA values out of the dataframe prior to summarizing.
not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
not_cancelled %>%
group_by(year, month, day) %>%
summarise(mean = mean(dep_delay))
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
## # A tibble: 365 x 4
## # Groups: year, month [12]
## year month day mean
## <int> <int> <int> <dbl>
## 1 2013 1 1 11.4
## 2 2013 1 2 13.7
## 3 2013 1 3 10.9
## 4 2013 1 4 8.97
## 5 2013 1 5 5.73
## 6 2013 1 6 7.15
## 7 2013 1 7 5.42
## 8 2013 1 8 2.56
## 9 2013 1 9 2.30
## 10 2013 1 10 2.84
## # ... with 355 more rows
Whenever you do any aggregation, it’s always a good idea to include either a count (n()) or a count of non-missing values (sum(is.na(x))). That way you can check that you’re not drawing conclusions based on very small amounts of data. For example, let’s look at the planes that have the highest average delays.
delays <- not_cancelled %>%
group_by(tailnum) %>%
summarize(
delay = mean(arr_delay)
)
ggplot(data = delays, mapping = aes(x = delay)) +
geom_freqpoly(binwidth = 10)
We get more information if we draw a scatterplot of number of flights vs. average delay.
delays <- not_cancelled %>%
group_by(tailnum) %>%
summarise(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
ggplot(data = delays, mapping = aes(x=n, y = delay)) +
geom_point(alpha = 1/10)
Now we see that there is much greater variation in the average delay when there are few flights. Typicaly, when you plot a mean (or other summary) vs. group size, you’ll see the variation decreases as the sample size increases.
When looking at this type of plot, it’s often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups. This is what the following code does, as well as showing you a handy pattern for integrating ggplot2 into dplyr flows.
delays %>%
filter(n > 25) %>%
ggplot(mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10)
There is another common variation of this type of pattern. Let’s look at how the average performance of batters in baseball is related to the number of time they’re at bat.
library(Lahman)
## Warning: package 'Lahman' was built under R version 4.0.5
# convert to a tibble so it prints nicely
batting <- as.tibble(Lahman::Batting)
## Warning: `as.tibble()` was deprecated in tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
batters <- batting %>%
group_by(playerID) %>%
summarise(
ba = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE), # batting average
ab = sum(AB, na.rm = TRUE) # at bat
)
batters %>%
filter(ab > 100) %>%
ggplot(mapping = aes(x=ab, y=ba)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Note: 1. The variation in our aggregate decreases as we get more data points 2. There’s a positive correlation between skill (ba) and oppotunities to hit the ball (ab). This is because teams control who gets to play, and obviously pikc their best players.
It is sometimes useful to combine aggregation with logical subsetting.
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
avg_delay1 = mean(arr_delay),
avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
)
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
## # A tibble: 365 x 5
## # Groups: year, month [12]
## year month day avg_delay1 avg_delay2
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 12.7 32.5
## 2 2013 1 2 12.7 32.0
## 3 2013 1 3 5.73 27.7
## 4 2013 1 4 -1.93 28.3
## 5 2013 1 5 -1.53 22.6
## 6 2013 1 6 4.24 24.4
## 7 2013 1 7 -4.95 27.8
## 8 2013 1 8 -3.23 20.8
## 9 2013 1 9 -0.264 25.6
## 10 2013 1 10 -5.90 27.3
## # ... with 355 more rows
Measures of spread are also useful.
sd()IQR()mad()# why is the distance to some destinations more variables than others?
not_cancelled %>%
group_by(dest) %>%
summarise(distance_sd = sd(distance)) %>%
arrange(desc(distance_sd))
## # A tibble: 104 x 2
## dest distance_sd
## <chr> <dbl>
## 1 EGE 10.5
## 2 SAN 10.4
## 3 SFO 10.2
## 4 HNL 10.0
## 5 SEA 9.98
## 6 LAS 9.91
## 7 PDX 9.87
## 8 PHX 9.86
## 9 LAX 9.66
## 10 IND 9.46
## # ... with 94 more rows
Measures of rank:
min()quantile()max()# when do the first and last flights leave each day?
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first = min(dep_time),
last = max(dep_time)
)
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
## # A tibble: 365 x 5
## # Groups: year, month [12]
## year month day first last
## <int> <int> <int> <int> <int>
## 1 2013 1 1 517 2356
## 2 2013 1 2 42 2354
## 3 2013 1 3 32 2349
## 4 2013 1 4 25 2358
## 5 2013 1 5 14 2357
## 6 2013 1 6 16 2355
## 7 2013 1 7 49 2359
## 8 2013 1 8 454 2351
## 9 2013 1 9 2 2252
## 10 2013 1 10 3 2320
## # ... with 355 more rows
Measures of position:
first()nth()last()not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first_dep = first(dep_time),
last = last(dep_time)
)
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
## # A tibble: 365 x 5
## # Groups: year, month [12]
## year month day first_dep last
## <int> <int> <int> <int> <int>
## 1 2013 1 1 517 2356
## 2 2013 1 2 42 2354
## 3 2013 1 3 32 2349
## 4 2013 1 4 25 2358
## 5 2013 1 5 14 2357
## 6 2013 1 6 16 2355
## 7 2013 1 7 49 2359
## 8 2013 1 8 454 2351
## 9 2013 1 9 2 2252
## 10 2013 1 10 3 2320
## # ... with 355 more rows
These functions are complementary to filtering on ranks. Filtering gives you all variables, with each observation in a separate row.
not_cancelled %>%
group_by(year, month, day) %>%
mutate(r=min_rank(desc(dep_time))) %>%
filter(r %in% range(r))
## # A tibble: 770 x 20
## # Groups: year, month, day [365]
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 2356 2359 -3 425 437
## 3 2013 1 2 42 2359 43 518 442
## 4 2013 1 2 2354 2359 -5 413 437
## 5 2013 1 3 32 2359 33 504 442
## 6 2013 1 3 2349 2359 -10 434 445
## 7 2013 1 4 25 2359 26 505 442
## 8 2013 1 4 2358 2359 -1 429 437
## 9 2013 1 4 2358 2359 -1 436 445
## 10 2013 1 5 14 2359 15 503 445
## # ... with 760 more rows, and 12 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## # r <int>
Counts:
sum(!is.na())n_distinct()not_cancelled %>%
group_by(dest) %>%
summarise(carriers = n_distinct(carrier)) %>%
arrange(desc(carriers))
## # A tibble: 104 x 2
## dest carriers
## <chr> <int>
## 1 ATL 7
## 2 BOS 7
## 3 CLT 7
## 4 ORD 7
## 5 TPA 7
## 6 AUS 6
## 7 DCA 6
## 8 DTW 6
## 9 IAD 6
## 10 MSP 6
## # ... with 94 more rows
dplyr provides a simple helper if all you want is a count.
not_cancelled %>%
count(dest)
## # A tibble: 104 x 2
## dest n
## <chr> <int>
## 1 ABQ 254
## 2 ACK 264
## 3 ALB 418
## 4 ANC 8
## 5 ATL 16837
## 6 AUS 2411
## 7 AVL 261
## 8 BDL 412
## 9 BGR 358
## 10 BHM 269
## # ... with 94 more rows
You could also use a wegiht variable. For example, you could use this to sum the total number of miles a plane flew.
not_cancelled %>%
count(tailnum, wt = distance)
## # A tibble: 4,037 x 2
## tailnum n
## <chr> <dbl>
## 1 D942DN 3418
## 2 N0EGMQ 239143
## 3 N10156 109664
## 4 N102UW 25722
## 5 N103US 24619
## 6 N104UW 24616
## 7 N10575 139903
## 8 N105UW 23618
## 9 N107US 21677
## 10 N108UW 32070
## # ... with 4,027 more rows
Counts and proportions of logical values:
# how many flights left before 5am?
not_cancelled %>%
group_by(year, month, day) %>%
summarise(n_early = sum(dep_time < 500))
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
## # A tibble: 365 x 4
## # Groups: year, month [12]
## year month day n_early
## <int> <int> <int> <int>
## 1 2013 1 1 0
## 2 2013 1 2 3
## 3 2013 1 3 4
## 4 2013 1 4 3
## 5 2013 1 5 3
## 6 2013 1 6 2
## 7 2013 1 7 2
## 8 2013 1 8 1
## 9 2013 1 9 3
## 10 2013 1 10 3
## # ... with 355 more rows
# what proportion of flights are delayed by more than an hour?
not_cancelled %>%
group_by(year, month, day) %>%
summarise(hour_prop = mean(arr_delay > 60))
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
## # A tibble: 365 x 4
## # Groups: year, month [12]
## year month day hour_prop
## <int> <int> <int> <dbl>
## 1 2013 1 1 0.0722
## 2 2013 1 2 0.0851
## 3 2013 1 3 0.0567
## 4 2013 1 4 0.0396
## 5 2013 1 5 0.0349
## 6 2013 1 6 0.0470
## 7 2013 1 7 0.0333
## 8 2013 1 8 0.0213
## 9 2013 1 9 0.0202
## 10 2013 1 10 0.0183
## # ... with 355 more rows
When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll up a dataset.
daily <- group_by(flights, year, month, day)
(per_day <- summarise(daily, flights=n()))
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
## # A tibble: 365 x 4
## # Groups: year, month [12]
## year month day flights
## <int> <int> <int> <int>
## 1 2013 1 1 842
## 2 2013 1 2 943
## 3 2013 1 3 914
## 4 2013 1 4 915
## 5 2013 1 5 720
## 6 2013 1 6 832
## 7 2013 1 7 933
## 8 2013 1 8 899
## 9 2013 1 9 902
## 10 2013 1 10 932
## # ... with 355 more rows
(per_month <- summarise(per_day, flights=sum(flights)))
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
## # A tibble: 12 x 3
## # Groups: year [1]
## year month flights
## <int> <int> <int>
## 1 2013 1 27004
## 2 2013 2 24951
## 3 2013 3 28834
## 4 2013 4 28330
## 5 2013 5 28796
## 6 2013 6 28243
## 7 2013 7 29425
## 8 2013 8 29327
## 9 2013 9 27574
## 10 2013 10 28889
## 11 2013 11 27268
## 12 2013 12 28135
(per_year <- summarise(per_month, flights = sum(flights)))
## # A tibble: 1 x 2
## year flights
## <int> <int>
## 1 2013 336776
If you need to remove grouping, and return to operations on ungrouped data, use ungroup()
daily %>%
ungroup() %>%
summarise(flights = n())
## # A tibble: 1 x 1
## flights
## <int>
## 1 336776
not_cancelled %>%
group_by(flight) %>%
summarise(
early_prop = mean(arr_delay < 0),
late_prop = mean(arr_delay > 0)
) %>%
filter(
early_prop == 0.5,
late_prop == 0.5
)
## # A tibble: 147 x 3
## flight early_prop late_prop
## <int> <dbl> <dbl>
## 1 64 0.5 0.5
## 2 69 0.5 0.5
## 3 100 0.5 0.5
## 4 107 0.5 0.5
## 5 110 0.5 0.5
## 6 113 0.5 0.5
## 7 151 0.5 0.5
## 8 154 0.5 0.5
## 9 194 0.5 0.5
## 10 278 0.5 0.5
## # ... with 137 more rows
not_cancelled %>%
group_by(flight) %>%
summarise(
late_prop = mean(arr_delay > 0)
) %>%
filter(
late_prop == 1
)
## # A tibble: 147 x 2
## flight late_prop
## <int> <dbl>
## 1 94 1
## 2 730 1
## 3 896 1
## 4 919 1
## 5 955 1
## 6 974 1
## 7 990 1
## 8 1015 1
## 9 1084 1
## 10 1086 1
## # ... with 137 more rows
not_cancelled %>%
group_by(flight) %>%
summarise(
early_30 = mean(arr_delay == -30),
late_30 = mean(arr_delay == 30)
) %>%
filter(
early_30 == 0.5,
late_30 == 0.5
)
## # A tibble: 0 x 3
## # ... with 3 variables: flight <int>, early_30 <dbl>, late_30 <dbl>
not_cancelled %>%
group_by(flight) %>%
summarise(
early_prop = mean(arr_delay == 0),
late_prop = mean(arr_delay == 2 * 60)
) %>%
filter(
early_prop == 0.99,
late_prop == 0.01
)
## # A tibble: 0 x 3
## # ... with 3 variables: flight <int>, early_prop <dbl>, late_prop <dbl>
Which is more important: arrival delay or departure delay?
We use arrival depature to compute all of these.
not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count())# match the first command using n() with summarise()
counts <- not_cancelled %>% count(dest)
summarise_counts <- not_cancelled %>%
group_by(dest) %>%
summarise(summarise_n = n())
full_join(counts, summarise_counts)
## Joining, by = "dest"
## # A tibble: 104 x 3
## dest n summarise_n
## <chr> <int> <int>
## 1 ABQ 254 254
## 2 ACK 264 264
## 3 ALB 418 418
## 4 ANC 8 8
## 5 ATL 16837 16837
## 6 AUS 2411 2411
## 7 AVL 261 261
## 8 BDL 412 412
## 9 BGR 358 358
## 10 BHM 269 269
## # ... with 94 more rows
# match the second command using sum() and summarise()
sums <- not_cancelled %>% count(tailnum, wt=distance)
summarise_sums <- not_cancelled %>%
group_by(tailnum) %>%
summarise(summarise_n = sum(distance))
full_join(sums, summarise_sums)
## Joining, by = "tailnum"
## # A tibble: 4,037 x 3
## tailnum n summarise_n
## <chr> <dbl> <dbl>
## 1 D942DN 3418 3418
## 2 N0EGMQ 239143 239143
## 3 N10156 109664 109664
## 4 N102UW 25722 25722
## 5 N103US 24619 24619
## 6 N104UW 24616 24616
## 7 N10575 139903 139903
## 8 N105UW 23618 23618
## 9 N107US 21677 21677
## 10 N108UW 32070 32070
## # ... with 4,027 more rows
is.na(dep_delay) | is.na(arr_delay)) is slightly suboptimal. Why? Which is the most important column?arr_delay is the most important column because there seems to be flights with dep_delay entries even though the arr_delay is NA.
# arr_delay is NA but dep_delay is not NA
filter(flights,
is.na(arr_delay),
!is.na(dep_delay))
## # A tibble: 1,175 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 1525 1530 -5 1934 1805
## 2 2013 1 1 1528 1459 29 2002 1647
## 3 2013 1 1 1740 1745 -5 2158 2020
## 4 2013 1 1 1807 1738 29 2251 2103
## 5 2013 1 1 1939 1840 59 29 2151
## 6 2013 1 1 1952 1930 22 2358 2207
## 7 2013 1 1 2016 1930 46 NA 2220
## 8 2013 1 2 905 822 43 1313 1045
## 9 2013 1 2 1125 925 120 1445 1146
## 10 2013 1 2 1848 1840 8 2333 2151
## # ... with 1,165 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# dep_delay is NA but arr_delay is not NA - no flights here
filter(flights,
is.na(dep_delay),
!is.na(arr_delay))
## # A tibble: 0 x 19
## # ... with 19 variables: year <int>, month <int>, day <int>, dep_time <int>,
## # sched_dep_time <int>, dep_delay <dbl>, arr_time <int>,
## # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>%
group_by(year, month, day) %>%
summarise(
prop_cancelled = mean(is.na(arr_delay)), # proportion cancelled
avg_delay = mean(dep_delay, na.rm = TRUE)
) %>%
ggplot(aes(x=prop_cancelled, y=avg_delay)) +
geom_point()
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
There seems to be a slight positive relationship between avg_delay and prop_cancelled, however there are some outlier days.
flights %>% group_by(carrier, dest) %>% summarise(n)))We have to decide whether departure or arrival delays are more important. Let’s check the relationship between the two.
ggplot(flights) +
geom_count(aes(x=arr_delay,
y=dep_delay,
size=..n..,
color=..n..)) +
guides(color = 'legend') +
geom_abline(slope=1, color='red')
## Warning: Removed 9430 rows containing non-finite values (stat_sum).
The two types of delays are quite similar with no obvious pattern of one being less than the other. It looks like there are more points under the red line, indicating that arrival delays tend to be shorter than departure delays. This makes sense because time can be made up during a smooth flight. It’s probably more important to people if they arrive at their destination on time, so we’ll go with arrival time.
flights %>%
filter(!is.na(arr_delay)) %>%
group_by(carrier) %>%
summarise(max_delay = max(dep_delay)) %>%
arrange(desc(max_delay))
## # A tibble: 16 x 2
## carrier max_delay
## <chr> <dbl>
## 1 HA 1301
## 2 MQ 1137
## 3 AA 1014
## 4 DL 960
## 5 F9 853
## 6 9E 747
## 7 VX 653
## 8 FL 602
## 9 EV 548
## 10 B6 502
## 11 US 500
## 12 UA 483
## 13 WN 471
## 14 YV 387
## 15 AS 225
## 16 OO 154
We remove the flights with NA values for arrival delay since we can’t do anything with those. We see that carriers with the worst delays are HA, MQ, and AA.
Now we’ll try disentagling effects of bad airports vs. bad carriers. We’ll group by the carrier and origin airport to find the max delays between these pairs. Then we can pass the results to ggplot so we can easily visualize them as a heatmap. We’re still seeing HA and MQ carriers stand out, and JFK airport seems to have a lot of arrival delays. We could follow the same procedure for destination airports as well.
flights %>%
filter(!is.na(arr_delay)) %>%
group_by(carrier, origin) %>%
summarise(max_delay = max(dep_delay)) %>%
arrange(desc(max_delay)) %>%
ggplot() +
geom_tile(aes(x=origin,
y=carrier,
fill=max_delay))
## `summarise()` has grouped output by 'carrier'. You can override using the `.groups` argument.
sort argument to count() do. When might you use it?The count() function will count the unique values of one or more variables, and the sort argumnet will display the largest groups at the top.
flights %>%
count(carrier, sort=TRUE)
## # A tibble: 16 x 2
## carrier n
## <chr> <int>
## 1 UA 58665
## 2 B6 54635
## 3 EV 54173
## 4 DL 48110
## 5 AA 32729
## 6 MQ 26397
## 7 US 20536
## 8 9E 18460
## 9 WN 12275
## 10 VX 5162
## 11 FL 3260
## 12 AS 714
## 13 F9 685
## 14 YV 601
## 15 HA 342
## 16 OO 32
Grouping is most useful in conjunction with summarise(), but you can also do convenient operations with mutate() and filter().
Find the worst members of each group:
flights %>%
group_by(year, month, day) %>%
filter(rank(desc(arr_delay)) < 10)
## # A tibble: 3,306 x 19
## # Groups: year, month, day [365]
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 848 1835 853 1001 1950
## 2 2013 1 1 1815 1325 290 2120 1542
## 3 2013 1 1 1842 1422 260 1958 1535
## 4 2013 1 1 1942 1705 157 2124 1830
## 5 2013 1 1 2006 1630 216 2230 1848
## 6 2013 1 1 2115 1700 255 2330 1920
## 7 2013 1 1 2205 1720 285 46 2040
## 8 2013 1 1 2312 2000 192 21 2110
## 9 2013 1 1 2343 1724 379 314 1938
## 10 2013 1 2 1244 900 224 1431 1104
## # ... with 3,296 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Find all groups bigger than a threshold
popular_dests <- flights %>%
group_by(dest) %>%
filter(n() > 365)
popular_dests
## # A tibble: 332,577 x 19
## # Groups: dest [77]
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ... with 332,567 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Standardise to compute per group metrics
popular_dests %>%
filter(arr_delay > 0) %>%
mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
select(year:day, dest, arr_delay, prop_delay)
## # A tibble: 131,106 x 6
## # Groups: dest [77]
## year month day dest arr_delay prop_delay
## <int> <int> <int> <chr> <dbl> <dbl>
## 1 2013 1 1 IAH 11 0.000111
## 2 2013 1 1 IAH 20 0.000201
## 3 2013 1 1 MIA 33 0.000235
## 4 2013 1 1 ORD 12 0.0000424
## 5 2013 1 1 FLL 19 0.0000938
## 6 2013 1 1 ORD 8 0.0000283
## 7 2013 1 1 LAX 7 0.0000344
## 8 2013 1 1 DFW 31 0.000282
## 9 2013 1 1 ATL 12 0.0000400
## 10 2013 1 1 DTW 16 0.000116
## # ... with 131,096 more rows
Refer back to the lists of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping.
Which plane (tailnum) has the worst on-time record?
flights %>%
filter(!is.na(arr_delay)) %>%
group_by(tailnum) %>%
mutate(max_delay=max(arr_delay)) %>%
arrange(desc(max_delay)) %>%
select(tailnum, max_delay) %>%
unique()
## # A tibble: 4,037 x 2
## # Groups: tailnum [4,037]
## tailnum max_delay
## <chr> <dbl>
## 1 N384HA 1272
## 2 N504MQ 1127
## 3 N517MQ 1109
## 4 N338AA 1007
## 5 N665MQ 989
## 6 N959DL 931
## 7 N927DA 915
## 8 N6716C 895
## 9 N5DMAA 878
## 10 N523MQ 875
## # ... with 4,027 more rows
The worse on-time record is held by flight N384HA with a 1272 arrival delay.
flights %>%
filter(!is.na(arr_delay)) %>%
group_by(sched_dep_time) %>%
summarise(avg_delay = mean(arr_delay)) %>%
arrange(desc(avg_delay))
## # A tibble: 1,020 x 2
## sched_dep_time avg_delay
## <int> <dbl>
## 1 2207 105.
## 2 1848 69
## 3 1752 68.2
## 4 1531 65.1
## 5 2339 59
## 6 1739 56.1
## 7 1653 55.4
## 8 1747 54.6
## 9 558 54.3
## 10 2103 52.4
## # ... with 1,010 more rows
The worst average delay occurs with flights that are scheduled to depart at 2207.
flights %>%
filter(!is.na(arr_delay)) %>%
group_by(tailnum) %>%
mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
select(tailnum, arr_delay, prop_delay)
## # A tibble: 327,346 x 3
## # Groups: tailnum [4,037]
## tailnum arr_delay prop_delay
## <chr> <dbl> <dbl>
## 1 N14228 11 0.0267
## 2 N24211 20 0.0200
## 3 N619AA 33 0.188
## 4 N804JB -18 0.045
## 5 N668DN -25 -0.198
## 6 N39463 12 0.0519
## 7 N516JB 19 0.00556
## 8 N829AS -14 -0.00379
## 9 N593JB -8 -0.00217
## 10 N3ALAA 8 0.0354
## # ... with 327,336 more rows
lag(), explore how the delay of a flight is related to the delay of the immeidately preceding flight.flights %>%
filter(!is.na(arr_delay)) %>%
group_by(origin) %>%
arrange(sched_dep_time) %>%
mutate(prev_delay = lag(arr_delay)) %>%
select(flight, arr_delay, prev_delay) %>%
ggplot() +
geom_point(aes(x=prev_delay,
y=arr_delay))
## Adding missing grouping variables: `origin`
## Warning: Removed 3 rows containing missing values (geom_point).
The y-axis shows the arrival delay for a particular flight, and the x-axis shows the arrival delay for the flight that was scheduled to depart just before it from the same origina airport. The plot doesn’t really show a strong relationship between the arrival delay of flights and the flights that departs immediately before.
flight_speed <- flights %>%
select(flight, tailnum, origin, dest, air_time, distance) %>%
mutate(speed = distance / (air_time/60)) %>%
arrange(desc(speed))
flight_speed
## # A tibble: 336,776 x 7
## flight tailnum origin dest air_time distance speed
## <int> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1499 N666DN LGA ATL 65 762 703.
## 2 4667 N17196 EWR MSP 93 1008 650.
## 3 4292 N14568 EWR GSP 55 594 648
## 4 3805 N12567 EWR BNA 70 748 641.
## 5 1902 N956DL LGA PBI 105 1035 591.
## 6 315 N3768 JFK SJU 170 1598 564
## 7 707 N779JB JFK SJU 172 1598 557.
## 8 936 N5FFAA JFK STT 175 1623 556.
## 9 347 N3773D JFK SJU 173 1598 554.
## 10 1503 N571JB JFK SJU 173 1598 554.
## # ... with 336,766 more rows
We compute the speed of each flight by taking the distance traveled (distance) divided by the time spend in the air (air_time). The air time is in minutes, so we divide by 60 to get speed in miles per hour. We can look at the distribution of flight speeds to find outliers.
summary(flight_speed$speed)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 76.8 358.1 404.2 394.3 438.8 703.4 9430
It looks like the median flight speed is about 404 miles per hour, and 75% of flights are flying under 438 miles per hour. This suggests that our fastest flights at about 600-700 miles per hour are quite fast compared to average flights.
We can now look at each flight’s air time in a destination-specific manner.
flights %>%
group_by(origin, dest) %>%
filter(!is.na(air_time)) %>%
mutate(time_diff = air_time - min(air_time)) %>%
arrange(desc(time_diff)) %>%
select(carrier:air_time, time_diff) %>%
group_by(origin, dest) %>%
slice(1) %>%
arrange(desc(time_diff))
## # A tibble: 223 x 7
## # Groups: origin, dest [223]
## carrier flight tailnum origin dest air_time time_diff
## <chr> <int> <chr> <chr> <chr> <dbl> <dbl>
## 1 DL 841 N703TW JFK SFO 490 189
## 2 DL 426 N178DN JFK LAX 440 165
## 3 AA 575 N5DBAA JFK EGE 382 163
## 4 UA 745 N578UA LGA DEN 331 145
## 5 UA 587 N852UA EWR LAS 399 143
## 6 UA 15 N77066 EWR HNL 695 133
## 7 B6 89 N794JB JFK SAN 413 132
## 8 UA 1075 N16709 EWR SNA 405 131
## 9 UA 1178 N18243 EWR AUS 301 127
## 10 B6 1295 N593JB JFK AUS 301 126
## # ... with 213 more rows
We group by the origin and destination so we can compare similar flights. Then compute the difference between air time and the minimum air time for flights on the same route. Then we can sort and take the flight with the largest time difference. Here we see that the largest time difference occurs for flight 841 from JFK to SFO.
delayed <- flights %>%
group_by(tailnum, year, month, day) %>%
arrange(dep_time) %>%
summarise(num_prior_delay = sum(arr_delay < 60),
num_delayed = sum(arr_delay > 60),
prop_prior_delayed = num_prior_delay / (num_delayed + num_prior_delay)) %>%
select(tailnum, year, month, day, num_prior_delay, num_delayed, prop_prior_delayed)
## `summarise()` has grouped output by 'tailnum', 'year', 'month'. You can override using the `.groups` argument.
delayed
## # A tibble: 251,727 x 7
## # Groups: tailnum, year, month [37,988]
## tailnum year month day num_prior_delay num_delayed prop_prior_delayed
## <chr> <int> <int> <int> <int> <int> <dbl>
## 1 D942DN 2013 2 11 0 1 0
## 2 D942DN 2013 3 23 1 0 1
## 3 D942DN 2013 3 24 1 0 1
## 4 D942DN 2013 7 5 1 0 1
## 5 N0EGMQ 2013 1 1 1 1 0.5
## 6 N0EGMQ 2013 1 2 2 0 1
## 7 N0EGMQ 2013 1 4 1 0 1
## 8 N0EGMQ 2013 1 5 1 0 1
## 9 N0EGMQ 2013 1 6 2 0 1
## 10 N0EGMQ 2013 1 7 3 0 1
## # ... with 251,717 more rows
These show the number of flights that were delayed, the number of flights prior to the delay, and the proportion of flights that occurred prior to the delay of one hour. This code assumes that once a flight is delayed by 1 hour in a specific day, it is never returns back to schedule. We can look at a summary of these numbers.
summary(delayed)
## tailnum year month day
## Length:251727 Min. :2013 Min. : 1.000 Min. : 1.00
## Class :character 1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00
## Mode :character Median :2013 Median : 7.000 Median :16.00
## Mean :2013 Mean : 6.565 Mean :15.73
## 3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00
## Max. :2013 Max. :12.000 Max. :31.00
##
## num_prior_delay num_delayed prop_prior_delayed
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:1.000 1st Qu.:0.000 1st Qu.:1.000
## Median :1.000 Median :0.000 Median :1.000
## Mean :1.205 Mean :0.108 Mean :0.919
## 3rd Qu.:1.000 3rd Qu.:0.000 3rd Qu.:1.000
## Max. :5.000 Max. :4.000 Max. :1.000
## NA's :6661 NA's :6661 NA's :6933
It looks like there is typically 1 flights before a plane is delayed by 1 at least one hour, with a maximum of 5 flights prior to the first 1 hour delay. However, it seems that most of the flights are not delayed by more than one hour in a specific day.
Not much going on in this chapter. Just a basic overview of writing scripts and using RStudio. Here’s the link: https://r4ds.had.co.nz/workflow-scripts.html
Two important questions to guide your data exploration:
Variation is the tendency of the values of a variable to change from measurement to measurement. The best way to understand pattern is to visualise the distribution of the variable’s values.
How you visualise the distribution of a variable depends on whether the variable is categorical or continuous. To examin the distribution of categorical variables, use a var chart:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
The height of the bars displays how many observations occurred with each x value. You can also compute this number manually using dplyr::count()
diamonds %>%
count(cut)
## # A tibble: 5 x 2
## cut n
## <ord> <int>
## 1 Fair 1610
## 2 Good 4906
## 3 Very Good 12082
## 4 Premium 13791
## 5 Ideal 21551
To examine the distribution of a continuous variable, use a histogram:
ggplot(data=diamonds) +
geom_histogram(mapping = aes(x=carat), binwidth=0.5)
You can compute this by hand by combining dplyr::count() and ggplot2::cut_width().
diamonds %>%
count(cut_width(carat, 0.5))
## # A tibble: 11 x 2
## `cut_width(carat, 0.5)` n
## <fct> <int>
## 1 [-0.25,0.25] 785
## 2 (0.25,0.75] 29498
## 3 (0.75,1.25] 15977
## 4 (1.25,1.75] 5313
## 5 (1.75,2.25] 2002
## 6 (2.25,2.75] 322
## 7 (2.75,3.25] 32
## 8 (3.25,3.75] 5
## 9 (3.75,4.25] 4
## 10 (4.25,4.75] 1
## 11 (4.75,5.25] 1
You should always play with different binwidths when making histograms, as different binwidths can reveal different patterns.
# take only the diamonds with less than 3 carats
smaller <- diamonds %>%
filter(carat < 3)
ggplot(data=smaller, mapping=aes(x=carat)) +
geom_histogram(binwidth = 0.1)
If you wish to overlay multiple histograms in the same plot, use geom_freqpoly() instead of geom_histogram(). geom_freqpoly() performs the same calculation as geom_histogram(), but instead of displaying the counts with bars, uses lines instead. It’s much easier to understand overlapping lines than bars.
ggplot(data = smaller, mapping = aes(x=carat, colour = cut)) +
geom_freqpoly(binwidth = 0.1)
Asking questions about the frequency of values are useful in data exploration. You can look for anything unexpected with questions like:
As an example, the histogram below suggests several intersting questions:
ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.01)
Clusters of similar values suggest that subgroups exist in your data. To understand the subgroups, ask:
The histogram below shows the length (in minutes) of 272 eruptions of the Old Faithful Geyser in Yellowstone National Park. Eruption times appear to be clustered into two groups: there are short eruptions (of around 2 minutes) and long eruptions (4-5 minutes), but little in between.
ggplot(data = faithful, mapping = aes(x = eruptions)) +
geom_histogram(binwidth = 0.25)
Many of the questions above will prompt you to explore a relationship between variables.
Sometimes outliers are difficult to spot on histograms. Take the distribution of the y variable from the diamonds dataset. The only evidence of outliers is the unusually wide limits on the x-axis.
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5)
There are so many observations in the common binds, that the rare bins are so short that you can’t see them. To make things easier to see, we can zoom into the small values of the y-axi using cood_cartesian().
ggplot(diamonds) +
geom_histogram(mapping = aes(x=y), binwidth=0.5) +
coord_cartesian(ylim = c(0, 50))
coord_cartesian() also has an xlim() argument to zoom into the x-axis. ggplot2 also has xlim() and ylim() functions that work slightly differently and throw away the data outside the limits.
ggplot(diamonds) +
geom_histogram(mapping = aes(x=y),
binwidth=0.5) +
ylim(0, 50)
## Warning: Removed 11 rows containing missing values (geom_bar).
You can see that the data has been removed when using the ggplot2 ylim function.
There appears to be three unusual values: 0, ~30, and ~60. We pluck them out with dplyr:
unusual <- diamonds %>%
filter(y < 3 | y > 20) %>%
select(price, x, y, z) %>%
arrange(y)
unusual
## # A tibble: 9 x 4
## price x y z
## <int> <dbl> <dbl> <dbl>
## 1 5139 0 0 0
## 2 6381 0 0 0
## 3 12800 0 0 0
## 4 15686 0 0 0
## 5 18034 0 0 0
## 6 2130 0 0 0
## 7 2130 0 0 0
## 8 2075 5.15 31.8 5.12
## 9 12210 8.09 58.9 8.06
The y variable measures one of the three dimensions of these diamonds, in mm. We know that diamonds can’t have a width of 0mm, so these values must be incorrect. We might also suspect that the measurements of 32mm and 59mm are implausible: those diamonds are over an inch long, but don’t cost hundres of thousands of dollars!
It’s good practice to repeat your analysis with and without the outliers. If they have minimal effect on the results, and you can’t figure out why they’re there, it’s reasonable to replace them with missing values, and move on. However, if they have a substantial effect on your results, you shouldn’t drop them without justification. You’ll need to figure out what caused them (e.g. a data entry error) and disclose that you removed them in your write-up.
diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.diamonds %>%
gather(key='dimension', value='value', x:z) %>%
ggplot() +
geom_histogram(aes(x=value,
fill=dimension),
binwidth = 0.5) +
facet_wrap(. ~ dimension, ncol=1) +
coord_cartesian(xlim=c(1,10))
We see that the z dimension tends to be smaller than the x or y dimensions. The x and y dimensions are similarly distrbuted. We typically see diamonds cut with circular profiles and varying height. We might therefore guess that the width and length are along the x and y axes, while the height is along the z-axis.
price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)ggplot(diamonds) +
geom_histogram(aes(x=price), binwidth = 100) +
coord_cartesian(xlim=c(0,2000))
There is an interesting gap in the distribution right at 1500 that is visible with binwidth 100.
ggplot(diamonds) +
geom_histogram(aes(x=carat)) +
coord_cartesian(xlim=c(0.99, 1.01))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There is no difference between diamonds that are 0.99 carat or 1 carat. It is probably difficult to distinguish these categories.
coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so noly half a bar shows?coord_cartesian() will not cut out data if it does not fit in the plot window, whereas xlim() and ylim() will cut out data if it’s entire plotting region is not shown in the window.
You might want to replace missing values in a dataframe with NA values. You can use the ifelse() function to do this.
diamonds2 <- diamonds %>%
mutate(y = ifelse(y < 3 | y > 20, NA, y))
There are three arguments for ifelse(), the first argument test should be a logical vector. The result will contain the value of the second argument yes whe test is TRUE, and the value of the third argument no, when it is false.
Alternatively, you can use dplyr::case_when(), which is particularly usefule inside mutate when you want to create a new variable that relies on a complex combination of existing variables.
Other times, you might want to understand what makes observations with missing values different to observations with recorded values. For example, in nycflights13::flights, missing values in the dep_time variable indicate that the flight was cancelled. So you might want to compare the scheduled departure times for cancelled and non-cancelled times.
nycflights13::flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot(mapping = aes(sched_dep_time)) +
geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)
sum(is.na(diamonds2$y))
## [1] 9
ggplot(diamonds2, alpha=0.7) +
geom_histogram(aes(x=y), fill='chocolate1') +
geom_bar(aes(x=y), fill='cornflowerblue')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 9 rows containing non-finite values (stat_bin).
## Warning: Removed 9 rows containing non-finite values (stat_count).
In both cases, the missing values are removed with a warning.
na.rm = TRUE do in mean() and sum()?mean(diamonds2$y)
## [1] NA
mean(diamonds2$y, na.rm = TRUE)
## [1] 5.733801
sum(diamonds2$y)
## [1] NA
sum(diamonds2$y, na.rm = TRUE)
## [1] 309229.6
With both mean and sum, if there are missing values, the output will be NA. When we use na.rm = TRUE, the missing values are removed and we can get a numeric answer returned.
If variation describes the behavior within a variable, covariation describes the behavior between variables. Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualize the relationship between two or more variables. How you do that should again depend on the type of variables involved.
It’s common to want to explore the distribution of a contious variable broken down by a categorical variable, as in the previous frequency polygon. The default appearance of geom_freqpoly() is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it’s hard to see the differences in shape. For example, let’s explore how the price of a diamond varies with its quality:
ggplot(data=diamonds, mapping=aes(x=price)) +
geom_freqpoly(mapping=aes(colour=cut), binwidth=500)
It’s hard to see the difference in distributions because the overall counts differ so much:
ggplot(diamonds) +
geom_bar(mapping = aes(x=cut))
To make the comparison easier, we can sawp what is displayed on the y-axis. Instead of displaying count, we’ll display density, which is the count standardised so that the area under each frequency polygon is one.
ggplot(diamonds, mapping=aes(x=price, y=..density..)) +
geom_freqpoly(mapping=aes(colour=cut), binwidth=500)
Another alternative to display the distribution of a continous variable broken down by a categorical variable is the boxplot. A boxplot is a type of visual shorthand for a distribution of values that is popular among statisticians.
Let’s take a look at the distribution of price by cut using geom_boxplot()
ggplot(diamonds, aes(x=cut, y=price)) +
geom_boxplot()
cut is an ordered factor: fiar is worse than good, which is worse than very good, and so on. Many categorical variables don’t have such an intrinsic order, so you might want to reorder them to make a more informative display. One way to do that is with the reorder() function.
For example, take the class varialbe in the mpg dataset. You might be interested to know how highway mileage varies across classes:
ggplot(mpg, mapping=aes(x=class, y=hwy)) +
geom_boxplot()
To make the trend easier to see, we can reorder class based on the median value of hwy:
ggplot(data=mpg) +
geom_boxplot(aes(x=reorder(class, hwy, FUN=median), y=hwy))
If you have long variable names, geom_boxplot() will work better for you if you flip is 90 degrees. You can do that with coord_flip()
ggplot(mpg) +
geom_boxplot(aes(x=reorder(class, hwy, FUN=median), y=hwy)) +
coord_flip()
nycflights13::flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot(mapping = aes(x=cancelled, y=sched_dep_time)) +
geom_boxplot(aes(color=cancelled))
head(diamonds)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
# separate categorical features
new_diamonds <- diamonds %>%
gather(key=feature, value=value, -price) %>%
mutate(feature_type = ifelse(feature %in% c('cut', 'color', 'clarity'),
'cat', 'cont'))
## Warning: attributes are not identical across measure variables;
## they will be dropped
cats <- new_diamonds %>%
filter(feature_type == 'cat')
cont <- new_diamonds %>%
filter(feature_type == 'cont')
# plot the continuous features against price to visualize relationships
ggplot(cont) +
geom_point(aes(x=value, y=price)) +
facet_wrap(. ~ feature, scales='free_x')
# plot categorical features
ggplot(cats) +
geom_boxplot(aes(x=value, y=price)) +
facet_wrap(. ~ feature, scales='free')
summary(lm(price ~ ., data=diamonds))
##
## Call:
## lm(formula = price ~ ., data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21376.0 -592.4 -183.5 376.4 10694.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5753.762 396.630 14.507 < 2e-16 ***
## carat 11256.978 48.628 231.494 < 2e-16 ***
## cut.L 584.457 22.478 26.001 < 2e-16 ***
## cut.Q -301.908 17.994 -16.778 < 2e-16 ***
## cut.C 148.035 15.483 9.561 < 2e-16 ***
## cut^4 -20.794 12.377 -1.680 0.09294 .
## color.L -1952.160 17.342 -112.570 < 2e-16 ***
## color.Q -672.054 15.777 -42.597 < 2e-16 ***
## color.C -165.283 14.725 -11.225 < 2e-16 ***
## color^4 38.195 13.527 2.824 0.00475 **
## color^5 -95.793 12.776 -7.498 6.59e-14 ***
## color^6 -48.466 11.614 -4.173 3.01e-05 ***
## clarity.L 4097.431 30.259 135.414 < 2e-16 ***
## clarity.Q -1925.004 28.227 -68.197 < 2e-16 ***
## clarity.C 982.205 24.152 40.668 < 2e-16 ***
## clarity^4 -364.918 19.285 -18.922 < 2e-16 ***
## clarity^5 233.563 15.752 14.828 < 2e-16 ***
## clarity^6 6.883 13.715 0.502 0.61575
## clarity^7 90.640 12.103 7.489 7.06e-14 ***
## depth -63.806 4.535 -14.071 < 2e-16 ***
## table -26.474 2.912 -9.092 < 2e-16 ***
## x -1008.261 32.898 -30.648 < 2e-16 ***
## y 9.609 19.333 0.497 0.61918
## z -50.119 33.486 -1.497 0.13448
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1130 on 53916 degrees of freedom
## Multiple R-squared: 0.9198, Adjusted R-squared: 0.9198
## F-statistic: 2.688e+04 on 23 and 53916 DF, p-value: < 2.2e-16
Based on the plots and the lm output, carat seems to have a huge effect on the price of a diamon as well.
ggplot(diamonds) +
geom_boxplot(aes(x=cut, y=carat))
We see that the carat value tends to be higher for the Fair cut diamonds. This combination of high carats with Fair cuts drives prices higher for lower quality cuts.
coord_flip?# use ggstance
ggplot(mpg, aes(hwy, class, fill=factor(cyl))) +
geom_boxploth()
# Horizontal with coord_flip()
ggplot(mpg, aes(class, hwy, fill = factor(cyl))) +
geom_boxplot() +
coord_flip()
Both methods produce the same plots, but the syntax to create these differ. The ggstance syntax is a bit more intuitive, but doesn’t make a huge difference.
geom_lv() to display the distribution of price vs cut. What do you learn? How do you interpret the plots?ggplot(diamonds) +
geom_lv(aes(x=cut, y=price, fill=cut))
ggplot(diamonds) +
geom_boxplot(aes(x=cut, y=price, fill=cut))
These letter-value plots give us more information beyond the quantiles of the boxplot. We see that the middle box represents the same information as boxplots, and each subsequent box that branches off of the middle encompasses 50% of the remaining data. We see that the boxes stretch further for the Very Good and Premium cuts, meaning that more people are paying higher values for those cuts. When a box is short, it means that more people are not paying very much more than the last value depicted by the previous box. These plots give us more information about the distribution of outliers and it makes sense to visualize larger data where outliers will be more numerous. One misleading feature is that the width of the boxes do not represent anything. It seems like violin plots are more intuitive than these letter-value plots.
Compare and contrast geom_violin() with a facetted geom_histogram(), or a coloured geom_freqpoly(). What are the pros and cons of each method?
If you have a small dataset, it’s sometimes useful to use geom_jitter() to see the relationship between a continous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.
To visualize covariation between categorical variables, you’ll need to count the number of observations for each combination. One way to do this is to use geom_count():
ggplot(data = diamonds) +
geom_count(aes(x=cut, y=color))
The size of each circle represents the count at each categorical combination. Covariation will appear as strong correlation between specific x values and specific y values.
Another approach is to compute the count with dplyr:
diamonds %>%
count(color, cut)
## # A tibble: 35 x 3
## color cut n
## <ord> <ord> <int>
## 1 D Fair 163
## 2 D Good 662
## 3 D Very Good 1513
## 4 D Premium 1603
## 5 D Ideal 2834
## 6 E Fair 224
## 7 E Good 933
## 8 E Very Good 2400
## 9 E Premium 2337
## 10 E Ideal 3903
## # ... with 25 more rows
Then visualize with geom_tile() and the fill aesthetic:
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x=color, y=cut)) +
geom_tile(mapping = aes(fill=n))
If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try d3heatmap or heatmaply packages, which create interactive plots.
diamonds %>%
count(color, cut) %>%
mutate(log_n = log10(n)) %>%
ggplot(mapping = aes(x=color, y=cut)) +
geom_tile(mapping = aes(fill=log_n))
We can take log values to get a better idea of how the counts are distributed.
geom_tile() together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?flights %>%
group_by(dest, month) %>%
filter(!is.na(dep_delay)) %>%
summarise(delays=mean(dep_delay)) %>%
ggplot(aes(x=month, y=dest)) +
geom_tile(aes(fill=delays))
## `summarise()` has grouped output by 'dest'. You can override using the `.groups` argument.
There were a lot of NA values that made the plot very difficult to read. After removing them, we have a cleaner plot that we could save to a larger file which would be readable. There are still many destinations that make this plot crowded. If we could filter these down to those that we care about or group them, we’d have a cleaner plot. The months would also make more sense if they were factors rather than numerical.
aes(x=color, y=cut) rather than aes(x=cut, y=color) in the example above?ggplot(data = diamonds) +
geom_count(aes(x=cut, y=color))
ggplot(data = diamonds) +
geom_count(aes(x=color, y=cut))
One great way to visualize the covariation between two continuous variables is with a scatter plot. For example, we can easily see an exponential relationship between the carat size and price of a diamon.
ggplot(diamonds) +
geom_point(aes(x=carat, y=price))
However, scatterplots become less useful as the size of your dataset grows, becaus epoint begin to overplot and pile up into areas of uniform black (as above). You can add the alpha aesthetic to add transparency to help with this.
ggplot(diamonds) +
geom_point(aes(x=carat, y=price), alpha = 1/100)
Using transparency can be challenging for very large datasets. Another solution is to use bin. Previously you used geom_histogram() and geom_freqpoly() to bin in one dimension. Now you’ll learn how to use geom_bin2d() and geom_hex() to bin in two dimensions.
geom_bin2d() and geom_hex() divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin. geom_bin2d() creates rectangular bins. geom_hex() creates hexagonal bins. You need to install the hexbin package to use geom_hex().
ggplot(smaller) +
geom_bin2d(aes(x=carat, y=price))
# install.packages('hexbin')
ggplot(smaller) +
geom_hex(aes(x=carat, y=price))
Another option is to bin one continuous variable so it acts like a categorical variable. Then you can use one of the technique for visualising the combination of a categorical and continuous variable that you learned about. For example, you could bin carat and then for each group, display a boxplot:
ggplot(smaller, aes(x=carat, y=price)) +
geom_boxplot(aes(group=cut_width(carat, 0.1)))
cut_width(x, width) as used above, divides x into bind of width width. By default, boxplots look roughly the same (apart from the number of outliers) regardless of how many observations there are, so it’s difficult to tell that each boxplot summarises a different number of points. One way to show that is to make the width of the boxplot proportional to the number of points with varwidth=TRUE.
ggplot(smaller, aes(x=carat, y=price)) +
geom_boxplot(aes(group=cut_width(carat, 0.1)), varwidth=TRUE)
Another approach is to display approximately the same number of points in each bin. That’s the job of cut_number():
ggplot(smaller, aes(x=carat, y=price)) +
geom_boxplot(aes(group=cut_number(carat, 20)))
cut_width() vs cut_number()? How does that impact a visualisation of the 2d distribution of carat and price?cut_width() will give you even bin sizes, but it won’t take into account the number of points within each of those bins. You could use cut_width() in combination with varwidth=TRUE to give an idea of how many points are within each bin. cut_number() will give you bins with the same number of points in the, but the sizes of each bin will differ from one another. If you have bins with unequal numbers of points in them, then it would be more informative to plot this information in the plot. We can see in the plots above how these two different option look. cut_number() offers a much more apparent reflection of the different number of points contained within regions.
ggplot(smaller, aes(x=price, y=carat)) +
geom_boxplot(aes(group=cut_number(price, 10)))
ggplot(smaller, aes(x=price, y=carat)) +
geom_boxplot(aes(group=cut_width(price, 1000)))
# create column for size
diamonds %>%
mutate(size=x*y*z) %>%
ggplot() +
geom_hex(aes(x=size, y=price))
For the most part, the price increases as size increases. There are a couple of point on this plot that show some diamonds with very large size and relatively low price. The price of these outlier points do not appear to be fully dependent on size.
ggplot(diamonds) +
geom_hex(aes(x=carat, y=price)) +
facet_wrap(. ~ cut)
ggplot(data = diamonds) +
geom_point(mapping = aes(x = x, y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
Why is a scatterplot a better display than a binned plot for this case?
The scatterplot is a better choice here because a binned plot tends to hide the outliers in the bins. For example, in the plot below, you can still see these outliers, but they do not stand out as much as in the scatter plot.
ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = x, y = y, group=cut_number(x, 10))) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))